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Abstract —We consider the problem of estimating the sparse 
time-varying parameter vectors of a point process model in an 
online fashion, where the observations and inputs respectively 
consist of binary and continuous time series. We construct a novel 
objective function by incorporating a forgetting factor mechanism 
into the point process log-likelihood to enforce adaptivity and 
employ -regularization to capture the sparsity. We provide a 
rigorous analysis of the maximizers of the objective function, 
which extends the guarantees of compressed sensing to our 
setting. We construct two recursive filters for online estimation 
of the parameter vectors based on proximal optimization tech¬ 
niques, as well as a novel filter for recursive computation of 
statistical confidence regions. Simulation studies reveal that our 
algorithms outperform several existing point process filters in 
terms of trackability, goodness-of-fit and mean square error. We 
finally apply our filtering algorithms to experimentally recorded 
spiking data from the ferret primary auditory cortex during 
attentive behavior in a click rate discrimination task. Our analysis 
provides new insights into the time-course of the spectrotemporal 
receptive field plasticity of the auditory neurons. 

Index Terms —Adaptive filtering; point process models; com¬ 
pressed sensing; neural signal processing; receptive field plastic¬ 
ity. 

I. Introduction 

Analyses of spiking activity recorded from sensory neurons 
have revealed three main features: first, neuronal activity is 
stochastic and exhibits significant variability across trials; 
second, the spiking statistics often undergo rapid changes 
referred to as neuronal plasticity, in order to adapt to changing 
stimulus salience and behavioral context; and third, the tun¬ 
ing characteristics of sensory neurons to the stimuli exhibit 
a degree of sparsity. Examples include place cells in the 
hippocampus ID and spectrotemporally tuned cells in the 
primary auditory cortex la. Hence, in order to gain insight 
into the functional mechanism of the underlying neural system, 
it is crucial to have a mathematical theory to simultaneously 
capture the stochasticity, dynamicity and sparsity of neuronal 
activity. 

On one hand, the theory of point processes 01 has been 
recently adopted as a mathematical framework to model the 
stochasticity of neuronal data. Traditionally, these models have 
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been used to predict the likelihood of self-exciting processes 
such as earthquake occurrences 01, 13, but have recently 
found significant applications in the analysis of neuronal data 

ii-iia. 

On the other hand, classic results in signal processing such 
as the Least Mean Squares (LMS) and Recursive Least Squares 
(RLS) algorithms CD have created a framework to efficiently 
capture the dynamics of the parameters in linear observation 
models. Existing solutions in computational neuroscience have 
adopted this framework to estimate the dynamics of neuronal 
activity. For instance, in 0 an LMS-type point process filter 
was introduced to study plasticity in hippocampal neurons. 
In EH, more general adaptive filtering solutions based on 
approximations to the Chapman-Kolmogorov equation were 
introduced. Although quite powerful in analyzing neuronal 
data, these solutions do not account for the sparsity of the 
underlying parameters. 

Finally, the theory of compressed sensing (CS) has provided 
a novel methodology for measuring and estimating statistical 
models governed by sparse underlying parameters 0 - 113 . 
In particular, for static linear and generalized linear models 
(GLM) with random covariates and sparsity of the parameters, 
the CS theory characterizes sharp trade-offs between the num¬ 
ber of measurement, sparsity, and estimation accuracy lfT3 . 
O- The sparse solutions of CS are typically achieved using 
batch-mode convex programs and greedy techniques. In online 
settings, sparse adaptive filters have only been introduced in 
the context of linear systems governed by sparse parameters 
such as communication channels O-ED- 

Despite significant progress in all these research fronts, a 
unified framework to simultaneously capture the stochasticity, 
dynamicity and sparsity of neuronal data is lacking. In this 
paper, we close this gap by integrating techniques from point 
process theory, adaptive filtering, and compressed sensing. To 
this end, we consider the problem of estimating time-varying 
stimulus modulation coefficients (e.g., receptive fields) from 
a sequence of binary observations in an online fashion. We 
model the spiking activity by a conditional Bernoulli point 
process, where the conditional intensity is a logistic function of 
the stimulus and its time lags. We will then design a novel ob¬ 
jective function by incorporating the forgetting factor mecha¬ 
nism of RLS-type algorithms into the -regularized maximum 
likelihood estimation of the point process parameters. We will 
present theoretical guarantees that extend those of CS theory 
and characterize fundamental trade-offs between the number 
of measurements, forgetting factor, model compressibility, and 
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estimation error of the underlying point processes in the non- 
asymptotic regime. We will next develop two adaptive filters 
for recursive estimation of the objective function based on 
proximal gradient techniques, as well as a filter for recursive 
computation of statistical confidence regions. 

In order to validate our algorithms, we provide simulation 
studies which reveal that the proposed adaptive filtering algo¬ 
rithms significantly outperform existing point process filters 
in terms of goodness-of-fit, mean square error and track- 
ability. We finally apply our proposed filters to multi-unit 
spiking data from ferret primary auditory cortex (Al) during 
passive stimulus presentation and during performance of a 
click rate discrimination task ll2^ in order to characterize 
the spectrotemporal receptive field (STRF) plasticity of Al 
neurons. Application of our algorithm to these data provides 
new insights into the time course of attention-driven STRF 
plasticity, with over 3 orders of magnitude increase in temporal 
resolution from minutes to centiseconds, while capturing the 
underlying sparsity in a robust fashion. Aside from their the¬ 
oretical significance, our results are particularly important in 
light of the recent technological advances in neural prostheses, 
which require real-time robust neuronal system identification 
from limited data. 

The rest of the paper is organized as follows: In Section 
HD we present our notational conventions, preliminaries and 
problem formulation. In Section [HD we introduce the main 
theoretical results of this paper, including the construction 
and stability analysis of the objective function, recursive filter 
development, and computation of confidence regions. Section 
HVl provides numerical simulations as well as application to 
real data, followed by our concluding remarks in Section |V] 
Technical details of Section [nD are presented in Appendices 
IAHC] 


II. Preliminaries and Problem Deeinition 


We first give a brief introduction to point process models 
(see 0 for a detailed treatment). We will use the following no¬ 
tation throughout the paper. Parameter vectors are denoted by 
bold-face greek letters. For example, uj = [cui, cu 2 , • • • , ^m]' 
denotes an M-dimensional parameter vector, with [•]' denoting 
the transpose operator. 

Consider a stochastic process defined by a sequence of 
discrete events at random points in time, noted hy t( = 
[fi, ^2, • • • ; ^nd a counting measure given by 

pt 

dN{t) ='y^S{t — tk), and N{t) = / dN{u), 

k=i -^0 


where S{.) is the Dirac’s measure. The Conditional Intensity 
Function (CIF) for this process, denoted by is defined 

as 


X(t\Ht) := lim 

e^O 


F {N{t + e) - N{t) = l\Ht) 


( 1 ) 


where Ht denotes the history of the process as well as the 
covariates up to time t. The CIF can be interpreted as the 
instantaneous rate given the history of the process and the 
covariates. A point process with a CIF given by is 

defined as: 


1) A^(0) = 0 

2) Given 0 = to < < ^2 < • • •, the random 

variables N{tk) — N{tk-i) are conditionally mutually 
independent. 

3) For any 0 < ti < t 2 , N{t 2 ) — N{ti) is a Poisson random 
variable with probability distribution 


p(iV(t2)-iV(ti) = /c) = 


X{t\Ht)dt^ ^ e- ■i'A' 


k\ 


A point process model is fully characterized by its CIF. 
For instance, X{t\Ht) = A corresponds to the homogenous 
Poission process with rate A. A discretized version of this 
process can be obtained by binning N(t) within an observation 
interval of [0, T] by bins of length A, that is 


nt := NitA) - iV((t - 1) A), t = 1, 2, • • • , T (2) 


where T := [T/A]. Throughout this paper, {nt}f=i will be 
considered as the observed spiking sequence, which will be 
used for estimation purposes. Also, by approximating Eq. O 
for small A 1, and defining At := X{tA\HtA), we have: 

P(nt = 0) = 1 - AtA -h o(A), 

P(rit = l) = AtA + o(A), (3) 

P(nt > 2) = o(A). 


In discrete time, the orderliness of the process is equivalent 
to the requirement that with high probability not more than 
one event fall into any given bin. In practice, this can always 
be achieved by choosing A small enough. An immediate 
consequence of Eq. ([3]) is that {nt}J^i can be approximated 
by a sequence of Bernoulli random variables with success 
probabilities {AtA}^]^. 

A popular class of models for the CIE is given by General¬ 
ized Linear Models (GEM). In its general form, a GEM con¬ 
sists of two main components: an observation model (which 
is given by © in this paper) and an equation expressing some 
(possibly nonlinear) function of the observation mean as a 
linear combination of the covariates. In neuronal systems, the 
covariates consist of extrinsic covariates (e.g., neural stimuli) 
as well as intrinsic covariates (e.g., the history of the process). 
In this paper, we only consider GEM models with purely 
extrinsic covariates, although most of our results can be 
generalized to incorporate intrinsic covariates as well. 

Let St denote the stimulus at time bin t, [6>o, • • • , Om- 2 ]' 

denote the vector of stimulus modulation parameters, and n 
denote the baseline firing rate. We adopt a logistic regression 
model for the CIE as follows: 


logit (At A) := log 


AtA 


1-AtA 


M-2 

d + OiSt-i (4) 

i=0 


By defining a; := [^, 6>o, 6>i, • • • , 0 m- 2]' and Xt := 

[1, St, • • • , St-M+2], we can equivalently write: 


At A = logit 


exp(a;^xt) 

1 -h exp(a;'xt) 


(5) 


The model above is also known as the logistic-link CIE 
model. Another popular model in the computational neu¬ 
roscience literature is the log-link model where AtA = 
exp(a;'xt). The significance of the logistic-link model is that 
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logit”^ (.) maps the real line (— 00 , + 00 ) to the unit probability 
interval ( 0 , 1 ), making it a feasible model for describing 
statistics of binary events independent of the scaling of the 
covariates and modulation parameters. 

Despite capturing the stimulus dependence in quite a general 
form, the GLM model in ([5]) represents a static model. We 
therefore generalize this model to the dynamic setting by 
allowing temporal variability of the modulation parameters: 


AtA = logit 


exp(cu^xt) 

1 + exp(a;'xt) 


( 6 ) 


where LCt := [/it, ..., 6 >t,M- 2 ]' represents the time- 

varying parameter vector at time t. Throughout the rest of the 
paper, we refer to x^ and ujt as the covariate vector and the 
modulation parameter vector at time t, respectively. 

In our applications of interest, the modulation parameter 
vector u: exhibits a degree of sparsity ll^ . (241. That is, only 
certain components in the stimulus modulation have significant 
contribution in determining the statistics of the process. These 
components can be thought of as the preferred or intrinsic 
tuning features of the underlying neuron. To be more precise, 
for a sparsity level L < M, we denote by 5” C {1,2, • • • , M} 
the support of the L highest elements of uj in absolute value, 
and hy the best L-term approximation to uj. We also define 


aL{uj) := ||u; - ujl\\i 


(7) 


to capture the compressibility of the parameter vector uj. 
Recall that for x G , the ^i-norm is defined as ||x||i : = 
\^i\- When aLiuj) = 0, the parameter uj is called L- 
sparse, and when aL{uj) is small compared to HculHi, the 
parameter is called L-compressible ll^ . 

Finally, the main estimation problem of this paper can 
be stated as follows: given binary observations and 

covariates {xt from a point process with a CIF given 

by Eq. m, the goal is to estimate the M-dimensional L- 
compressible parameter vectors {uJt}f=i in an online and 
stable fashion. 


^ windows of length W samples each, we assume that the 
CIF for each time point {k — 1)W +1 < f < kW is governed 
by ujt = uJk, for k = 1,2, ■ ■ ■ ,K. Note that number of spiking 
samples K is assumed to be an integer multiple of window 
size W, without loss of generality. 

Invoking the Bernoulli approximation to the Poisson statis¬ 
tics for A <?:; 1, the log-likelihood of the observation rit at 
time t can be expressed as: 

logp(nt) Ri nt log(AtA) + (1 - rit) log(l - At A) 

= nt(Xia;t) - log(l+exp(x[a;t)). ( 8 ) 

Assuming conditional independence of the spiking events, the 
joint log-likelihood of the observations within window k is 
given by: 

w 

k^ki^k) •= 'y^^'n'(^k-l)W-\-j^(k-l)W+i^k 

^ = 1 

- log (1 + exp(xj^_i)^^^.a;fe)) (9) 

In order to explicitly enforce adaptivity in the log-likelihood 
function, we adopt the forgetting factor mechanism of the 
RLS algorithm, where the log-likelihood of each window is 
exponentially weighted regressively in time, with a forgetting 
factor 0 < /3 < 1. That is, the effective data log-likelihood up 
to and including window k is taken to be: 

k 

Cf^{ujk) ■.= Y,(3'^-^Ci{uJk) ( 10 ) 

i=l 

for some 0 < /3 < 1. Note what for (3 = 1, C^{uJk) coincides 
with the natural data log-likelihood. Moreover, if we replace 
the Bernoulli log-likelihood with the Gaussian log-likelihood, 
then C^{uJk) coincides with the conventional RLS objective 
function. 

Next, in order to explicitly enforce sparsity, we adopt the 

-regularization mechanism of CS. That is, at window k, we 
seek an estimate of the form: 


III. Main Results 

In this section, we will first describe the construction of 
an appropriate objective function for addressing our main 
estimation problem. We will then present a rigorous analysis 
of the maximizers of the objective function, which extends 
the results of CS to our setting. Next, we will introduce two 
adaptive filters to recursively maximize the objective function 
based on proximal gradient techniques. Finally, we will outline 
how statistical confidence bounds can also be constructed in 
a recursive fashion for our estimates. 

A. £i-regularized Exponentially Weighted Maximum Likeli¬ 
hood (ML) 

Before proceeding with the construction of the objective 
function, we need to introduce more notational conventions. In 
order to have a framework allowing multi-timescale dynamics, 
we consider piece-wise constant dynamics for the parameter 
ujt. That is, we assume that uJt remains constant over windows 
of arbitrary length W > 1 samples, for some integer VF. By 
segmenting the corresponding spiking data into K : = 


= argmax - 7ll^fe||i} (H) 

<^k 

where 7 is a regularization parameter controlling the trade off 
between the log-likelihood fit and the sparsity of estimated 
parameters. Our theoretical analysis in the next subsection 
reveals appropriate choices for 7 , /3 and the trade-offs therein. 

B. Stability Analysis of the Objective Function 

In order to quantify the trade-offs involving our choice of 
the objective function in Eq. (HB, we proceed in the tradition 
of performance analysis result of the RLS algorithm m by 
characterizing the geometric properties of the estimates uJk in a 
stationary environment where uJk = uJ for all k. Our analysis, 
however, is quite general and avoids ad hoc assumptions 
such as direct averaging or covariate independence which are 
usually invoked in the analysis of least squares problems. 

We need to make the following technical assumptions for 
our analysis: 

1) The stimulus sequence {st}f^_M+i consists of in¬ 
dependent (but not necessarily identically distributed) ran¬ 
dom variables with a variance of which are uniformly 
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bounded by a constant 5 > 0 in absolute value. Note 
that with this assumption, two successive covariate vec¬ 
tors, say at times t and t + 1, given respectively by 
Xt = [1, St-M+2, St-M+3, St-M+4:, ' ' ' , St] and Xt_|_i = 

[l,St-M+s,St-M+4:,-■ ■ ,st,st+i] are highly dependent, as 
they have M — 3 random variables in common. Hence, 
the independent assumption used in studying least squares 
problem is violated. 

2) We further assume that for all times t, 0 < pmin < 
At A < pmax < 1, for some constants pmin and pmax, i-e., the 
probability of spiking does not reach its extremal values of 0 
and 1, but can get arbitrarily close. This assumption can be 
realized due to the boundedness of the covariates and appro¬ 
priate normalization of the stimulus modulation coefficients, 
and does not result in any practical loss of generality. 

We have the following theoretical result regarding the 
stability of the maximizers of the objective function: 


Theorem 1: Suppose that binary observations from a point 
process with a CIF given by Eq. are given over K windows 
of length W each. Consider a stationary environment with 
LOk = (jO for all k and suppose that uj is L-compressible. Then, 
under assumptions (1) and (2), for a fixed positive constant 
d > 0, there exist constants C, C, and C" such that for 


1 - 


C 


< (3 < I, K > 


and a choice of 'y > 


L'^ log M ^ H log(l) 

/ logM ^ jo/nfion UJ to ([77]) satisfies the bound 


\UJ — UJ\ 


, < C^J{l-(3)L logM+ VC'ctl(u;) ^{l-(3)L logM, 


with probability at least 1 — The constants C, C, and 
C" are only functions of d, Pmin, Pmax. cr^. B, and W, and 
are explicitly given in Appendix^ 


Proof: The proof is given in Appendix [^ ■ 

Remarks. The result of Theorem [T] has four major implica¬ 
tions. First, the error bound scales with s/L, the sparsity level, 
as opposed to M, the ambient dimension of the parameter 
vector, which is consistent with results from CS, and results in 
the robustness of the estimate when the underlying parameter 
is sparse. Note that the bounds holds for general non-sparse uj, 
but is sharpest when aL{uj) is negligible, i.e., the parameter 
vector is nearly L-sparse. 

Second, the theorem prescribes a lower bound on the 
forgetting factor akin to the bounds obtained in CS theory 
for the total number of observations. For instance, the result of 
ll26t for CS under Toeplitz sensing measurements for the linear 
model requires T = 0{L‘^ logM) number of measurements to 
achieve a similar scaling of the error bound. In our case, the 
role of the number of measurements is transferred to forgetting 
factor by taking as the effective length of the measure¬ 
ments. In the absence of the forgetting factor (/3 = 1), by a 
careful limiting process, our results require T = 0{L‘^ logM) 
measurements. The latter case can be compared to the result of 
im for point process models with independent and identically 
distributed covariate vectors, which requires 0{L\ogM) for 
stability. The loss of 0{L) is incurred due to the shift structure 
and hence high dependence of the covariate vectors in our 
case, as exemplified in assumption (1). 


Third, the theorem reveals the scaling of the regularization 
parameter in terms of M and fi. In particular, this scaling is 
significant as it reveals another role for the forgetting factor 
mechanism: not only the forgetting factor mechanism allows 
for adaptivity of the estimates, it also controls the scaling of 
the -regularization term with respect to the log-likelihood 
term. Fourth, unlike conventional results in the analysis of 
adaptive filters which concern the expectation of the error in 
the asymptotic regime, our result holds for a single realization 
with probability polynomially approaching 1 , in the non- 
asymptotic regime. 

Note that the objective function is clearly concave, and 
assuming that the matrix of the covariate vectors is full-rank, 
will be strictly concave with a unique maximizer. However, 
the result of Theorem [T] does not require the uniqueness of 
the maximizer and holds for any maximizer of the objective 
function. In the next section, we will proceed with the de¬ 
velopment of recursive filters to track the maximizer of the 
objective function in the more general time-varying setting. 

C. Algorithm Development 

Several standard optimization techniques, such as interior 
point methods, can be used to find the maximizer of (HI])- 
However, most of these techniques operate in batch mode and 
do not meet the real-time requirements of the adaptive filtering 
setting where the observations arrive in a streaming fashion. In 
order to avoid the increasing runtime complexity and memory 
requirements of the batch-mode computation, we seek a recur¬ 
sive approach which can perform low-complexity updates in 
an online fashion upon the arrival of new data in order to form 
the estimates. To this end, we adopt the proximal gradient 
approach. A version of the proximal gradient algorithm is 
given in Appendix |B] Each iteration of the algorithm moves 
the previous iterate along the gradient of the log-likelihood 
function, which will then pass through a shrinkage operator. 

Before describing further details, we introduce 

a more compact notation for convenience. Let 

nfc := [n(fc_i)w+i, ?^(fe-i)vK-ri 5 • • • 5 denote the vector 

of observed spikes within window k, for k = 1,2,..., K. 
Similarly, let := [A(/c_i)w+i, • • •; 

denote the vector of CIFs within window k. By extending 
the domain of the logit ~^(-) to vectors in a component-wise 
fashion, we can express A^A as: 

AfcA = logif^ ( 12 ) 

where := [jC(^k-i)w+i,^{k-i)w+ 2 , ■ ■ ■ ,^kw]' is the data 
matrix of size WxM with rows corresponding to the covariate 
vectors in window k. Suppose that at window k, we have an 
iterate denoted by ujj^f for £ = 0,1, • • • , R, with R being an 
integer denoting the total number of iterations. The gradient 
of C^f) evaluated at can be written as: 

(Sf) =: g, (d<'>) (13) 

1=1 

where Sif) := rii — Ai(-)A represents the innovation vector 
of the point process at window i. The innovation vector Si can 
be thought of as the counterpart of the conventional innovation 
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vector in adaptive filtering of linear models. The proximal 
gradient iteration for the -regularization can be written in 
the compact form as: 

+ «gfc {^k^) ) (14) 

where Sr{-) is the element-wise soft thresholding operator at a 
level of T given in Appendix |B] The final estimate at window 
k is obtained following the i?th iteration, and is denoted by 
cDfc := . In order to achieve a recursive updating rule for 

gfc, we can rewrite Eq. O as: 



However, in an adaptive setting, we only have access to values 
of gfc-i evaluated at In order to turn Eq. O into 

a fully recursive updating rule, all the previous CIE vectors 
should be recalculated at the most recent set of 
iterates In order to overcome this computational 

burden, we exploit the smoothness of the logistic function and 
employ the Taylor series expansion of the CIE to approximate 
the required recursive update. In what follows, we consider 
the zeroth order and first order expansions, which result in 
two distinct, yet fully recursive, updating rules for Eq. (Us])- 


Zeroth Order Expansion: By retaining only the first term in 

around cDj, 

(16) 


the Taylor series expansion of the CIE Aj ( uj 


we get: 


Xi ) A se Aj [uJi] A 


) 

where AiA = logit ^(Xio)*). Substituting this approximation 
in Eq. (flSt . we can express the zeroth order approximation to 
the gradient at window k, denoted by g^(-), as: 

g2(4'’)=E'5"“‘X'£.(ai.) (17) 

i=l 

It is then straightforward to obtain a recursive form as: 



The shrinkage step will be then given by: 


(jj 


(^+ 1 ) 


5. 


7 a 



+ «gfc 



(19) 


First Order Expansion: If instead, we retain the first two 
terms in the Taylor expansion, Eq. (fT^ will be replaced by: 

A, A ^ A, {Qi) A + A, {Q,) (20) 

where Ai(a)j) is a diagonal W xW matrix with the (m, m)th 
diagonal element given by A(i_i)w+mA(l - A(i_i)w+mA). 
Using the first order approximation above, we can improve the 

resulting approximation to the gradient, denoted by g^, as: 

k 

el (sf) = X'(e.(D,) - A,(S,)X.(2f - 2,)) 

i=l 

By defining: ^ 

i=l 

k 

^k'’ :=5^/3"-'X'A,(cD0Xi, (22) 

we can express g^ ( 

gj(sr) 

It is then straightforward to check that both and can 
be updated recursively mu as: 

+ Xt(et (2if>( + Ak ( 25 ^'*) Xi2if’) 

Bl'’=/JBS + XiAt(2<'')Xt 

(i) (i) 

Note that the update rules for both ' and ^ involve 
simple rank-VE operations. The shrinkage step is then given 

( (l - (24) 

We refer to the resulting filter as the -regularized Point 
Process Eilter of the Eirst Order (£i-PPEi). A pseudo-code 
is given in Algorithmic 

Remark. The computational complexity of £i-PPEo and ii- 
PPEi algorithms can be shown to be linear and quadratic 
in M per iteration, respectively. Our results in Section IIVI 
will reveal that both filters outperform existing filters of the 
same complexity, respectively. Eurthermore, £i-PPEi exhibits 
superior performance over ^i-PPEq as expected, although with 
a cost of 0{M) in computational complexity per iteration. 


We refer to the resulting filter as the -regularized Point 
Process Eilter of the Zeroth Order (£i-PPEo). A pseudo-code 
is given in Algorithm [T] 


Algorithm 1 -regularized Point Process Filter of the Zeroth Order 

(^i-PPFq) 

Inputs: n/c, Xfc, gfc-i, and R . 

1 : for i = 0,..., R — 1 do 
2: AfcA = logif^ 

3: = life - AfcA 

4: gfc = /3 gfc-i + XfcSfc 

5: + agk] 

6: end for 
Output: u)fc := 


Algorithm 2 -regularized Point Process Filter of the First Order 

(^i-PPEi) 

Inputs: rifc, Xfc, Ufc_i, Bfc_i, and R . 

1 : for i = 0,..., R — 1 do 
2: AfcA = logif^ (^XfccD^f^^ 

3: Sfc = nfc - AfcA 

4: {■X-k)m,m — {Xk')m^ (1 (Afc)mA), 771 = 1, • • • , W 

5: Ufc = /3 Ufc_i -f Xj^ ^Sfc -|- AfcXfca;i, 

6 : Bfc =/3 Bfc_i-h X'^AfcXfc 

7: gfc = Ufc - Bfcuijf^ 

8: [tlif + agfc] 

9: end for 
Output: cDfc := 
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D. Constructing Confidence Intervals 

Characterizing the statistical confidence bounds for the 
estimates is of utmost importance in neural data analysis, as 
it allows to test the validity of various hypotheses. Although 
construction of confidence bounds for linear models in the 
absence of regularization is well understood and widely ap¬ 
plied, regularized ML estimates are usually deemed as point 
estimates for which the construction of statistical confidence 
regions is not straightforward. A series of remarkable results in 
high-dimensional statistics have recently addressed 

this issue by providing techniques to construct confidence 
intervals for -regularized ML estimates of linear models as 
well as GLMs. These approaches are based on a careful in¬ 
spection of the Karush-Kuhn-Tucker (KKT) conditions for the 
regularized estimates, which admits a procedure to decompose 
the estimates into a bias term plus an asymptotically Gaussian 
term (referred to as ’de-sparsifying’ in lUHl), which can be 
computed using a nodewise regression ll^ of the covariates. 

In what follows, we give a brief description of how the 
methods of ll28l apply to our setting, and leave the details to 
Appendix ICl Using the result of li28l . the estimate cPfc as the 
maximizer of (HB can be decomposed as: 

= 0fcgfc(u)fc) + Wfc (25) 

where ©fc is an approximate inverse to the Hessian of C^{<jo) 
evaluated at gfc is the gradient of previously 

defined in Eq. (fTsT l. and Wfc is an unbiased and asymptot¬ 
ically Gaussian random vector with a covariance matrix of 
cov(wfc) = ©fcG/c(u)/c)©fe, with 
k 

Gk{Qk) := 5^/32(^-*)x'si(£Dfc)£,(cPfc)'X,. (26) 

i=l 

The first term in Eq. (1251 ) is a bias term which can be directly 
computed given ©fe. Given cov(wfc), statistical confidence 
bounds for the second term at desired levels can be constructed 


Algorithm 3 Recursive Construction of the Confidence Re¬ 
gions for the mth Component of w/c0 

I -(0) 

Inputs: n/c, Xfe, Ufc, Bfc, and uik, Gk-i, m, 7 ^, and . 

1 - gfc — life BfeCPfe 

2: Gfc = /32Gfc-i+X;£fc4Xfc 
3: for i = 0,..., R — 1 do 

5: end for 

^ z p\ 

'^m ~ (B/c)m,m ~ 'firn (®fc)m,\m 
7: (c)^ = 1, and (c)\^ = 

8. (0/c)m ~ 

^k,m •“ {®k)mGk{^k)m 
10 : (w/j)^ (®fe)mg/c 

Output: C7^fc,m := [{^k)m ± $“^1 “ o:/2)ak,m] 


"For a matrix A £ , we denote by the mth row with 

the mth element removed, and by the submatrix of A with both 

the mth row and column removed. 


in a standard way. The main technical issue in the aforemen¬ 
tioned procedure in our setting is the computation of ©^ in a 
recursive fashion. Since the rows of ©^ are computed using 

-regularized least squares, we use the SPARES algorithm 
Ga as an efficient method to carry out the computation in 
a recursive fashion. Algorithm 0 summarized the recursive 
computation of confidence intervals for the mth component 
of Wfc. 

IV. Applications 

In this section, we will apply the proposed algorithms to 
simulated data as well as real spiking data from the ferret 
primary auditory cortex. In our simulation studies, we compare 
the performance of our proposed filters with two of the state- 
of-the-art point process filters, namely the steepest descent 
point process filter (SDPPE) ||7l and the stochastic state point 
process filter (SSPPE) lfl4l . These adaptive filters are based 
on approximate solutions to the Chapman-Kolmogorov for¬ 
ward equation obtained by a steepest descent and a Gaussian 
approximation procedure, respectively. 

A. Simulation Study I: MSE and Sparse Recovery Learning 
Curves 

Eirst, we consider a stationary environment where uj is con¬ 
stant over time. We use a bin size of A = 1 ms and window 
size of VL = 1 sample, for a total observation window of 
T = 30 sec (K = 30000). The length of the parameter vector 
uj = [p,0] is chosen as M = 101. Eor each realization, we 
draw a sparse parameter vector 9 of fixed length M — 1 = 100 
and sparsity L = 3. The support S and values of the nonzero 
components of 9 are chosen randomly and the values are 
normali z ed so that ||0||2 = 10. The stimulus input sequence 
{sk}k=-M+i drawn from an i.i.d. Gaussian distribution 
A/’(0,cr^). The binary spike train {nk}^^i is generated as a 
single realization of conditionally independent Bernoulli trials 
with success rate A^A. The stimulus variance cr^ is chosen 
as = 0.01 small enough so that the average spiking rate 
Xavg^ = 0.13 1 to ensure that the Bernoulli approximation 

is valid. All the simulations are done with R = 1 iteration per 
time step. The step size a is chosen as a ~ 9 x 10“^ (See 
Appendix IB] for details). 

Eor a given pair of (a, fi) parameters, we select an optimal 
value for the regularization parameter 7 by performing a two¬ 
fold even-odd cross validation procedure: first, the data are 
split into two sets of even and odd samples in an interleaved 
manner. Then, one set is used as the training set for estimation 
of the parameter vector ojk and the other is used to assess the 
goodness-of-fit of the estimates cDfc with respect to the log- 
likelihood of the observations. We repeat the process switching 
the role of the two sets and take the average as the overall 
measure of fit. 

Let E denote the averaging operator with respect to realiza¬ 
tions. We consider two performance metrics: the normalized 
mean squared error (MSE) defined as MSE/. := E||a)/c — 
a;/c|p/E||a;A:|p to evaluate MSE performance at time step fc; 
and the out-of-support energy defined as SPM/c := E\\9k — 
{9k)s\\‘^/^\\9k\\‘^ to represent a sparsity metric (SPM). Ideally, 
SPM/c must be equal to zero at all times. The averaging is 
carried out over a sufficiently large number of runs. 
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Time (s) 

Fig. 1. Learning curves of the adaptive filtering algorithms in a stationary 
environment. A) MSE vs. time, B) SPM vs. time. 

Figure [T] shows the corresponding learning curves for 
the four algorithms. According to Figure [T]-A, the ^i-PPFi 
achieves the lowest stationary MSE measure of —10.4 dB, 
followed £i-PPFo which achieves an MSE of —9.2 dB. The 
SSPPE and SDPPE algorithms respectively achieve an MSE 
of —3.35 dB and —1.9 dB, which reveals a gap of se 7 dB 
with respect to our proposed filters. Note that this result is 
consistent with the prediction of Theorem [T] and highlights 
the MSE gain achieved by -regularization, as opposed to 
ML, when the underlying parameter is sparse. 

B. Simulation Study 2: Tracking and Goodness-of-fit Perfor¬ 
mance 

In the second simulation scenario, we consider a more 
realistic setting where oJk evolves in time. Eurthermore, as in 
the case of real data applications, we assume that the support 
of uik is not available as a performance benchmark and resort 
to statistical goodness-of-fit test. These tests for point process 
models have been developed as an application of the time¬ 
rescaling theorem ED, 01 and consist of the Kolmogorov- 
Smirnov (KS) test for assessing the conditional intensity 
estimation accuracy, and the Autocovariance Eunction (ACE) 
test to assess the conditional independence assumption. We 
skip the details in the interest of space, and refer the readers 
to the aforementioned references for a detailed treatment. 

As in the previous case, we consider a bin size of A = 1ms, 
window size of VF = 1 , and a total observation window of 
T = fiOsec {K = 60000 bins). The stimulus is generated as 
in the previous case. Eor the parameter vector ujk, we choose 
a fixed baseline rate of Hk = —2.51 to set the baseline spiking 
rate to XavgkS 0 . 1 , and select a sparse modulation vector 
9k of length M = 100 with a support S = {1,10, 20} of size 
L = 3, and respective values of {9k){i ,10,20} = {10,-5, 5} 
for k < Kl2. Halfway through the test, at k = A'/2-|-l, 
the largest component {6k)i, drops rapidly and linearly to 0 , 
within a window of length 1 sec and remains zero for the rest 
of the run. 

Eigure [2] shows the performance of all four algorithms in 
the aforementioned setting. Each row (A through D) shows the 
true time-varying parameter vector (dashed traces) as well as 
the filtered estimates (solid traces) in the left panel. In partic¬ 
ular, the gray solid traces show the out-of-support components 
which must ideally be equal to zero. The colored hulls around 
{6k)i show the 95% confidence intervals (note that confidence 


intervals for SDPPE cannot be directly obtained and require 
averaging over multiple realizations). The middle and right 
panels show the KS and ACE test results at a 95% confidence, 
respectively. Eor the quadratic algorithms ^i-PPEi and SSPPE, 
a forgetting factor of /3 = 0.9995 is chosen. The regularization 
parameter for £i-PPEi is chosen as 7 = 0.5, obtained by the 
aforementioned two-fold even-odd cross validation. Eor the 
first order algorithm ^i-PPEq, a smaller forgetting factor of 
(3 = 0.995 is chosen to ensure stability, and a value of 7 = 0.1 
is used based on cross validation. These settings ensure that 
all the algorithms are tuned in their optimal operating point 
for fairness of comparison. 

Eigure [2]-A and [2]-B reveal three striking performance 
gaps between the two second-order algorithms (with the same 
computational complexity, quadratic in M): first, the out-of- 
support components (gray traces) of £i-PPEi are significantly 
smaller than those of SSPPE; second, the confidence regions 
of £i-PPEi are narrower than those of SSPPE; and third, li- 
PPEi fully passes the KS test, while SSPPE marginally does 
so. Similarly, comparing the two first order algorithms (with 
the same computational complexity, linear in M) Eigure [2]-C 
and[2]-D reveal that the £i-PPEo significantly suppresses the 
out-of-support components as compared to SDPPE. Moreover, 
^i-PPEo provides confidence bounds, which cannot be directly 
obtained for SDPPE. Einally, £i-PPEo marginally fails the KS 
test, whereas SDPPE does so significantly. Both algorithms fail 
the ACE test, which shows that the second-order corrections 
embedded in £i-PPEi and SSPPE is necessary to achieve a 
better goodness-of-fit, which a price of higher computational 
complexity. 

We also inspect the estimated firing probability \k{u}k)^ 
for the four algorithms in Eigure [3] In addition, we include 
the probability estimated by the normalized reverse correlation 
(NRC) method, which is commonly used in neural data 
analysis, and fits the modulation parameters using a linear 
model. Eigure [2 shows the true spiking probability (blue 
solid trace) and the resulting spikes (black vertical lines). In 
the subsequent rows (B through E), the true and estimated 
probabilities are shown by dashed blue and solid red traces, 
respectively. A comparison of all the rows reveals that £i-PPEi 
and £i-PPEo outperform SSPPE and SDPPE, respectively, in 
terms of estimating the true probability. The NRC method 
is inferior to the preceding four algorithms, and results in 
negative estimates of the probability due to its use of a linear 
model (as opposed to logistic). 

C. Application to Real Data: Dynamic Analysis of Spec- 
trotemporal Receptive Field Plasticity 

The responses of neurons in the primary auditory cortex 
(Al) can be characterized by their spectrotemporal receptive 
fields (STREs), where each neuron is tuned to a specific region 
in the time-frequency plane, and only significantly spikes when 
the acoustic stimulus contains spectrotemporal contents match¬ 
ing its tuning region El (See EigurelH top row, leftmost panel). 
Several experimental studies have revealed that receptive fields 
undergo rapid changes in their characteristics during attentive 
behavior in order to capture salient stimulus modulations ll 2 D . 
ES, 031. In ||2D . it is suggested that this rapid plasticity 
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Fig. 2. Performance comparison of the adaptive filtering algorithms: A) £i-PPFi, B) SSPPF, C) ^i-PPFq, and D) SDPPF In each row, the left panel shows 
the trie parameter vector with dashed traces and the estimates with solid traces. Colored hulls show the 95% confidence intervals for one of the components. 
The middle and right panels show the corresponding KS and ACF test plots, respectively. Red traces show confidence regions at a level of 95% for both tests. 
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Fig. 3. Firing rate estimates for adaptive filtering algorithms within an 
interval of 34.2 s < t < 34.4 s: A) tme rate (blue solid trace) and spikes 
(black vertical lines), B) £i-PPFi, C) SSPPF, D) £i-PPFo, E) SDPPF, and 
F) normalized reverse correlation (NRC). In rows B through F, the dashed 
blue traces and solid red traces show the true rate and the estimated rate, 
respectively. 


has a significant role in the functional processes underlying 
active listening. However, most of the widely-used estimation 
techniques (e.g., normalized reverse correlation) provide static 
estimates of the receptive field with a a temporal resolution 
of the order of minutes. Moreover, they do not systematically 
capture the inherent sparsity manifested in the receptive field 
characteristics. 

In the context of our model, the STRF can be modeled as an 
(/ X J)-dimensional matrix, where I and J denote the number 
of time lags and frequency bands, respectively. By vectorizing 
this matrix, we obtain an (M — 1)-dimensional vector 9k at 
window k, where M = I x J+1. Augmenting the baseline rate 
parameter Hk, we can model the activity of the A1 neurons 
using the logistic GIF with a parameter ujk := 

The stimulus vector at time t, St is given by the vectorized 
version of the spectrogram of the acoustic stimulus with J 
frequency bands and I lags. In order to capture the sparsity of 
the STRF in the time-frequency plane, we further represent 6k 
over a Gabor time-frequency dictionary consisting of Gaussian 
windows centered around a regular subset of the / x J time- 
frequency plane. That is, for 6k = where F is the 

dictionary matrix and is the sparse representation of the 
STRF. The estimation procedures of this paper can be applied 
to ^k^ by absorbing the dictionary matrix into the data matrix 
Xfe at window k. 
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Fig. 4. The time-course of task-dependent STRF plasticity of a ferret A1 neuron. The top row shows snapshots of the STRF at five selected points in time, 
marked by the dashed vertical lines in the bottom graph. The bottom graph shows the time-course of five selected points (A through E) in the STRF marked 
on the leftmost panel of the top row. 


We apply our proposed adaptive filter £ i-PPFi to multi-unit 
spike recordings from the ferrets A1 during a series of passive 
listening conditions and active auditory task conditions. Dur¬ 
ing each active task, ferrets attended to the temporal dynamics 
of the sounds, and discriminated the rate of acoustic clicks 
||3^ . The STRFs were estimated from the passive condition, 
where the quiescent animal listened to a series of broadband 
noise-like acoustic stimuli known as Temporally Orthogonal 
Ripple Combinations (TORC). The experiment consisted of 2 
active and 11 passive blocks. Within each passive block, 30 
TORCs were randomly repeated a total of 4-5 times each. In 
our analysis, we pool the spiking data corresponding to the 
same repeated TORC within each block. Therefore, the time 
axis corresponds to the experiment time modulo repetitions 
within each block. We discretize the resulting duration of 
T = 990s to time bins of size A = 1 ms, and segment 
data to windows of size VP = 10 samples (10 ms). The 
STRF dimensions are 50 x 50, regularly spanning lags of 
1 to 50 ms and frequency bands of 0.5 kHz to 16 kHz 
(in logarithmic scale). The dictionary F consists of 13 x 13 
Gabor atoms, evenly spaced within the STRF domain. Each 
atom is a two-dimensional Gaussian kernel with a variance of 
11^/4 per dimension, where D denotes the spacing between 
the atoms. We selected a forgetting factor of /3 = 0.9998, a 
step size of a = where is the average variance of 

the spectrogram components, R = 1 iterations per sample, and 
a regularization parameter of 7 = 40 via two-fold even-odd 
cross validation. 

Figure 01 top row, depicts five snapshots taken at 
{180,360,540, 630, 990} sec corresponding to the end-points 
of the {2,4, 6 , 7, ll}th passive tasks. The bottom row shows 
the time-course of five selected points (marked as A through 
D in the leftmost panel of the top row) of the STRF during 
the experiment. The STRF snapshots at times 180 and 540 sec 


correspond to 90 secs after the two active tasks, respectively, 
and verify the sharpening effect of the excitatory region 
(^ 30 msec, 8 kHz) due to the animal’s attentive behavior 
following the active task reported in ll22l . Moreover, the STRF 
snapshots at times 360 and 630 sec reveal the weakening of 
the excitatory region long after the active task and returning to 
the pre-active state, highlighting the plasticity of A1 neurons. 
Previous studies have revealed the STRF dynamics with a 
resolution of the order of minutes ||34|. Our result in Figure 
m provides a temporal resolution of the order of centiseconds 
(3 orders of magnitude increase), while capturing the STRF 
sparsity in a robust fashion. 

V. Concluding Remarks 

In this paper, we considered recursive estimation of the 
time-varying parameter vectors in a logistic regression model 
for binary time series driven by continuous input. To this end, 
we integrated several techniques from compressed sensing, 
adaptive filtering, optimization and statistics. We constructed 
an objective function which enjoys from the trackability 
features of the RLS-type algorithms, sparsifying features 
of £ 1 -minimization, and unlike the rate-based linear models 
commonly used to analyze spiking data, takes into account 
the binary statistics of the observations. We analyzed the 
maximizers of the objective function in a rigorous fashion, 
revealing novel trade-offs between various model parameters. 
We constructed two adaptive filters, with respective linear and 
quadratic complexity requirements, for recursive maximi z ation 
of the objective function in an online setting. Moreover, 
we characterized the statistical confidence regions for our 
estimates, and devised a recursive procedure to compute them 
efficiently. 

Although we specialized our treatment to logistic statistics 
and -regularization, our approach to algorithm development 
has a plug-and-play feature: other GLM link functions (e.g.. 
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log-link) with possibly history dependent covariates and other 
regularization schemes (e.g., re-weighted £i, or group-sparse 
regularization) can be used instead and result in a large class of 
adaptive filters for sparse point process regression. We tested 
the performance of our algorithms on simulated as well as 
experimentally recorded spiking data. Our simulation studies 
revealed that the proposed filters outperform several existing 
point process filters. Application of our filters to real data from 
the ferret primary auditory cortex provided a high-resolution 
characterization of the time-course of spectrotemporal recep¬ 
tive field plasticity, with 3 orders of magnitudes increase 
in temporal resolution. Although we focused on auditory 
neurons, we expect a similar superior performance of our 
filters when applied to other sensory or motor neurons (e.g., 
cells in primary or supplementary motor cortex (351). 

Appendix A 
Proof of Theorem [T] 

The proof is mainly based on the beautiful treatment of 
Negahban et al. (HI. The major difficulty in our case lies 
in the high inter-dependence of the covariates, which form 
a Toeplitz structure due to the setup of adaptive filtering. 
We address the latter issue by adopting techniques from 
another remarkable paper by Haupt et al. (261 to deal with the 
underlying interdependence. In the process, we also employ 
concentration inequalities for dependent random variables due 
to van de Geer (3^ . 

In order to proceed, we adopt the notion of Strong Restricted 
Convexity (RSC) introduced in im. For a twice differentiable 
log-likelihood with respect to a;, the RSC property or order 
L implies the existence of a lower quadratic bound on the 
negative log-likelihood: 

:= -Cf^{uj + A)+C^{uj) + AVC^{uj) > ac||A||2, 

for a positive constant k > 0 and all A e satisfying: 

II A5c||i < 3 IIA 5 II 1 -h 40 - 5 ( 0 ;). (27) 

for any index set S C {1, 2, • • • , M} of cardinality L. The 
following key lemma establishes the RSC for 

Lemma 1: Let denote a sequence of covariates 

and let ui denote the corresponding logistic parameters. Then, 
for a fixed positive constant d > 0, there exist constants C 
and K > 0 such that for f3 > 1 — ^ and K > ^ 

the negative log-likelihood satisfies the RSC of order 

L with constant with probability greater than 1 — 

The constants C and n are only functions of d, Pmin, Pmax, 
B, W, and are explicitly given in the proof 

Proof: The proof is inspired by the elegant treatment of 
Negahban et al. Ga. The major difficulty in our setting is 
the high interdependence of successive covariates due to the 
shift structure induced by the adaptive setting, whereas in (Tsl, 
the matrix of covariates is composed of i.i.d. rows. Using the 
Taylor’s theorem, T>c{A^uj) can be written as: 

l+exp 


K W 

t=i j=i 


K 


_ exp (; 


with CO* = o; -|- tA for some r G (0,1). Since by hypothesis 
0 < Pmin < AiA < Pmax < 1, wc havc: 

exp > ^ . n - « ) 129) 

y 7 \ \ 2 ^ PmintJ- Pmax/- ) 

(1+exP 

We can therefore further lower bound 22/:(A, co) by: 

I2/:(A, Co) > Pmm(l - Pm^^WNp { A'C /3 A} , (30) 

-I_ oK-\-l 

where := , and 

K w 

c3-;4-EE (31) 

p i=i j=i 

Note that the matrix C/j has highly inter-dependent elements 
due to the Toeplitz structure in the adaptive design. In order to 
establish the RSC condition, we show the stronger Restricted 
Eigenvalue (RE) property, which in turn implies RSC (371. Let 
6 G (0,1) be fixed. To do so, we need to bound the eigenvalues 
of (C/ 3 ) 5 , the restriction of to a subset of columns and 
rows corresponding to indices in S C with 

I S'! = rL, for some integer r > 1 -|- 
Without loss of generality, we assume that the first entry 
of the CO variate vectors X/ is replaced by a instead of 1, 
for presentational simplicity of the following treatment. Eor 
m, m' / 1, we have: 

K w-i 

EE 

i=l j=0 






For m = m' = 1, 






1 


i=l 



1 - 

1-/3 


1 , 


and for m 1, 

K w-i 

EE 

i=l j=0 


= (C^)i,m = 


aN^ 


We also have = <3mm'- Using Hoeffding’s 

inequality (^ we get: 


P (|(C^)m,m 


11 > f) < 2 exp 
= 2 exp 
< 2 exp 


2t^Nl 


54 IU/32(^-*) 


/ 2A|fV4 \ 




(32) 


since < Np, for /3 G [0,1]. Similarly, 

P(|(C/3)i,m| >t) < 2exp . 

By adopting the partitioning technique of Theorem 4 in (26l, 
we also get for m m'\ 

>t)< 4exp ) • 
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Let Bq ;= max{5^, -fr}- Now, using the union bound we 
have: 


M 


P U ||(C/)W 

\m,m' = l '' 

where we have used (?)< 



< 2M^exp - 


32Bnr^L^ 




and. 


M 


\m=l 


p U 



< 2M exp (— 


NpS^a'^ 
ABor^L^ 


Now, by invoking the Gershgorin’s disc theorem, the eigneval- 
ues of any sub-matrix of C /3 restricted to an index set S with 
ISj = rL, lie in the interval [1 — 5,1 + 5] with probability at 
least: 


1 - 2M^ exp - 


NpS^a‘^ \ 
32Bor^L^ ) 


f NgS'^a'^ \ 


> 1 - SM^ exp - 


iV/3(5V2 \ 
32Bnr^L^ j 


(33) 


Hence, by choosing Np > logM, the prob¬ 

ability above is greater than 1 — 

Next, by invoking Lemma 4.1 (ii) of OTI . we have that C /3 
satisfies the RSC condition over the set given by Eq. (127] ) with 
a constant given by: 

2 


ko = {1-S){1-3 


i+s 


(r-l)(l-(5) 


(34) 


Hence, the negative log-likelihood satisfies the RSC with a 
constant given by Pmin(l - Pmax)o-^iV/ 3 Ko. Finally, by taking 
K > , we have that > 2 {^i 3 ) ’ which makes k 

independent of K and 13, given by: 


Pmin(l - 7»max)cr^/^0hL 


(35) 


and /3 > 1 — 


c' 

L'^ log M 


with C : = 


W5^ 

QABocF'^r'^{d+2) ' 


Next, the result of Theorem 1 of ifTSi implies: 


I- II / 27 a/L 

\(jj — cu 2 S-1“ 


27 a-L(u;) 


(36) 


for 7 > 2||V£^(cu)||oo- We have, for m 7 ^ 1, 


Proposition 1: Consider a sequence of cr-fields Bq <Z <Z 
■■■. Suppose that Xi is -measurable with \Xi\ < Bi for 
some constant Bi, i = 1, 2, • • • and ¥.{Xi\Bi-i} = 0. Then 
for all f > 0 , 

p(|:X.>t)<exp(-^). 

Proof: This result is a special case of Theorem 2.5 of 
ll^ for bounded and possibly dependent random variables, 
which generalizes Hoeffding’s inequality. ■ 

In our case, we can similarly show that 
E [st {nt - At A} I = E [s^E [{rit - At A} \Bt-i, Bt]] = 
0. Moreover, each summand is bounded by 2j3^~^B. Hence, 
using the result of Proposition [T] by taking n = TW and 
Xi = Si[ni - At A], we get: 

P(|(V£^(.)) J > tN,) < 2exp 

<2exp(-A^y 

Using the union bound, we have: 

P(||V£'’(a.)||^ > tN^) < 2Afexp • (37) 

By choosing t = j^^ve that 

||V£^(a;)||^ < ^2(d + l)A/ 3 logM, (38) 
with probability at least 1 — 

Hence, for a fixed () < 1, d > 0, and r > 1 + , 

by taking /3 > 1 - with (U' := Q^s^XHd+ 2 ) 

7 > C ”with C" := Y^4(d + 1)IU, any maximizer cD 
satisfies: 

||d)-a;||2<C'V(l-/3)LlogM + VC'cjL(u;)^(l-/3)LlogM, 
with probability at least 1 — — ■^, where C is given by 

_ 7l28(d+l) _ 

VH7„|„(1 -p„„)cr2(l - A) (l _ 3^ (r-17-i j) 


K W 

^S(i-l)W+j+m-M+l 

i=l j=l 
{n(i 

Now, let Bt be the cr-field generated by • • • , St, i.e., 

a{s-M+i, • • • , St). We have that 

E [{nt - At A} St] = E [E [{nt - At A} St\Bt]] 

= E [stE [{nt - At A} | J^t]] 

= E[gtE[ {AtA-AtA}J J-t]j = 0 . 

=0 

Hence for all m, E {(V£^(u;))^} = 0. We next invoke 
the following result for concentration of dependent random 
variables: 


Appendix B 

The Proximal Gradient Algorithm 

In this appendix, we give an overview of the proximal 
gradient algorithm for minimization of convex function. The 
corresponding algorithm for maximization of concave func¬ 
tions can be obtained by negating the objective functions. 
Consider the general optimization problem 

min /(x)+^(x), (39) 

X 

where functions /(x) : -P- M and g{x) : > R U {c«} 

are assumed to be closed proper convex functions. Suppose 
that / is differentiable with a Lipschitz continuous gradient 
V/ with constant L(V/). The function g can be possibly 
non-smooth. A wide range of practical optimization problems 
can be cast in this form, particularly in the context of machine 
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learning ||3^ . where the objective function can be decomposed 
into a loss function and a regularization term. 

The proximal gradient algorithm provides an iterative pro¬ 
cedure for solving (l39l ) in the following form: 

[x(^) - V/(x(^))], (40) 

where the parameter is an appropriately chosen step size 
at iteration i so that aW < ^ 

, and the proximal operator 
Vagi-) of function g with parameter a is defined as 

Pag(x) := argmin |^(u) + ^l|u - xH^}. (41) 

Among the several interpretations available for the proximal 
gradient method, we have adopt a quadratic approximation- 
based model to derive the main iterative scheme in (EOI). 
This interpretation ED, is based on the Majorization- 
Minimization algorithm (see for a detailed discussion). 
In the approximation-based derivation, the £-th iteration for 
solving the general problem (|3^ can be written in the follow¬ 
ing form: 

x(^“i“^) = argmin /q:(x, x*-^^) -|- 5 '(x), (42) 

U 

where the original objective function / is replaced with 
a quadratically-regularized linear approximation around the 
previous iterate x^^^, given by 

/o(x, y) := /(y) + V/(y)'(x - y) + T ||x - y|||, (43) 

2a 

where the quadratic term is referred to as the trust region 
penalty. Modulo constants, the objective function in (|4^ can 
be rearranged to get the proximal gradient form 

x(^"i"^) = argmin| 5 '(x) -|- — ||x — x*-^^ -|- aV/(x*'^^)|| 2 | 

= P«3[x(^)-aV/(xW)]. (44) 

The proximal operator often admits closed form expres¬ 
sions. As for -regularization, the proximal operator takes 
the simple form of the soft thresholding shrinkage operator 
^a||.||i =■ <Sa whose zth component is given by 

{Saix))i := sgnixi){xi - a)+, (45) 


with sgn denoting the standard signum function, and (a)+ : = 
max{o, 0}. In this case, the proximal algorithm leads to a fam¬ 
ily of algorithms referred to as iterative shrinkage algorithms 
Ell-Ell, where each iteration involves a simple gradient 
descent step followed by a shrinkage operation. 

Finally, in our setting, the function / is taken to be 
the exponentially weighted log-likelihood Due to the 

smoothness of the logistic function, the Lipschitz constant 
for VC^iuJk) can be upper bounded by the trace of the 
Hessian Bfc(a^fc) given in Eq. (1241) . Noting that the elements 
of Ai are at most equal to 1/4, we get LiVC^{uJk)) < 
i ELi Y. 7=i Using assumption (1) of Sec¬ 

tion |IlFB] and an application of Hoeffding’s inequality, shows 
that the sum is concentrated around its mean given by , 

for large enough k. Therefore, we choose the step size 
a = 


_ (1-/3) 

cMWa^ 


, for some constant c > 1/4. 


Appendix C 

Computation of Confidence Intervals 

The -regularized ML estimate of Eq. (fTTI) can be written 
in the following form 

cDfc = argmax{qi/3£(a;fc) - 7||a;fc||i} , (46) 

where Ciuj) := logp(n|X, cu) denotes the log-likelihood 
function over a generic window with spiking vector n, 
data matrix X and parameter vector a;, and the operator 
)p^/(n, X, a;) is defined for a function / : {0,1}^ x 
as the empirical expectation exponentially 
weighted with a forgetting factor /3: 

k 

= (47) 

i=l 

where we have suppressed the dependence of / on n and X 
on the left hand side for notational simplicity. Eollowing the 
treatment of Theorem 3.1 of ||3^ . the corresponding empirical 
gradient vector and Hessian are respectively given by: 

k 

gfc(a;fc) := %V£(a;fe) = /3^-'X'£,(a;fe), (48) 

i=\ 

k 

Bkicok) := ^fiV^VicOk) = -J2 X'A,(a;fc)X,. (49) 

1 = 1 

The KKT conditions for the estimator ujk can be then written 
as: 

gfc(cDfc) - 7Sfc = 0, ||sfc||oo < 1- (50) 

where Sk C || i is a subgradient vector from the subdiffer¬ 
ential of the ii norm, with components {sk)m = sgn ((a)fc)m) 
for (a)fc)m / 0 and |(sfe).^| < 1 otherwise, for m = 
1,2,..., M. Substituting ^,g£(a;/c) by its quadratic approxi¬ 
mation around the true parameter vector tJk, and inverting the 
corresponding KKT conditions, the ’de-sparsified’ estimator 
Wfc can be obtained as: 

Wfe := cDfe - ©fcg/c(u)fc), (51) 

where the matrix ©fe is the approximate inverse of Hessian 
matrix Bfc(a;fc), and can be computed using the following 
node wise regression procedure ll^ . To compute the mth row 
of ©fc, first the solution to the following LASSO problem is 
obtained: 

■= argmin (-2(Bfc)m,\m'0 + ^'(Bfc)\m,\m^ + 27 m||^/’||i), 

where the dependence of B/. on cDfc is suppressed for nota¬ 
tional convenience, and the subscript notations are the same 
as those described in the footnote of Algorithmic Then, we 
define the vector c G as: 

(c)^ = 1, (c)\^ = (52) 

and the scaling constant as 

'^m •“ (Bfc)m,m “ 'iprn (^k)m\m- (^5) 

Einally, the mth row of ©fc is given by {&k)m -= ■:^c. The 
variance and the confidence interval at a level of a for the 
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mth component of cDfe can then be computed as given in lines 
9 and the output of Algorithm [3] ll3^ . where 

k 

i=l 

Using Taylor expansion similar to that in the development 
of £i-PPFi, the matrix Gfc(u)/c) can be recursively updated 
as given in line 2 of Algorithm [3 Finally, the node wise 
regression can be recursively computed using the SPARLS 
algorithm ESI, which is given in lines 3-5 of Algorithm [3] 
The parameter 7 ^, can be chosen to be in the same order of 
7 in (fTTI) . 
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